Skip to content

block 4: observability & ops (version, health, metrics, access-log, log-format)#74

Merged
aksOps merged 8 commits intomainfrom
feat/block4-observability
Apr 24, 2026
Merged

block 4: observability & ops (version, health, metrics, access-log, log-format)#74
aksOps merged 8 commits intomainfrom
feat/block4-observability

Conversation

@aksOps
Copy link
Copy Markdown
Contributor

@aksOps aksOps commented Apr 24, 2026

Summary

  • Add /api/version (public JSON with version/commit/build_date/go_version/deps) backed by a new internal/buildinfo package; Makefile LDFLAGS retargeted at internal/buildinfo.
  • Add /healthz (always-200 liveness) and /readyz (SQLite ping + LLM reach, 10 s in-memory cache, 503 when SQLite or LLM fails).
  • Add Prometheus client_golang backend via new internal/obs package: docsiq_http_requests_total, docsiq_http_request_duration_seconds, docsiq_pipeline_stage_duration_seconds, docsiq_embed_latency_seconds, docsiq_llm_tokens_total, docsiq_workq_depth, docsiq_workq_rejected_total, docsiq_build_info, plus client_golang defaults. Workq gains Pool.Stats() + a rejectedTotal counter.
  • Extend loggingMiddleware to emit one structured access-log line per request with req_id, method, path, route, status, duration_ms, bytes_out, auth (bearer|cookie|anon), project, panic. Emission is deferred so panics escaping recoveryMiddleware still produce an entry.
  • Add log.format=text|json config (default text; DOCSIQ_LOG_FORMAT env). JSON handler is wrapped by obs.NewProductionHandler which strips leading emoji from Record.Message.
  • Auth middleware bypass list extended with /healthz, /readyz, /metrics, /api/version.

Test plan

  • CGO_ENABLED=1 go test -tags sqlite_fts5 -timeout 300s ./... all pass
  • CGO_ENABLED=1 go test -tags sqlite_fts5 -race -timeout 300s ./... all pass
  • CGO_ENABLED=1 go vet -tags sqlite_fts5 ./... clean
  • Smoke test live server: /api/version, /healthz, /readyz, /metrics all return expected shapes; /readyz correctly reports 503 when LLM endpoint unreachable
  • docsiq_http_request_duration_seconds uses bounded route="GET /healthz" labels (Go 1.22 r.Pattern) rather than raw paths
  • UI tests not run (node_modules not installed in this worktree; UI untouched by this block)

Notes / follow-ups

  • LLM token counter uses a byte/4 approximation (kind="total"); threading real usage from langchaingo's GenerationInfo is a future follow-up. Flagged in commit message + source comment.
  • Pipeline stage timing wraps IndexPath/IndexURL/Finalize via a nil-safe timeStage helper. Fine-grained per-phase (load/chunk/embed/extract/community) wrapping would require restructuring indexFile's nested Phase 1a/1b/1c/2 logic; deferred as a follow-up.

🤖 Generated with Claude Code

@aksOps aksOps enabled auto-merge (squash) April 24, 2026 02:54
aksOps and others added 6 commits April 24, 2026 03:08
Moves the shared version-resolution logic out of cmd into a new
internal/buildinfo package so internal/api can serve the same data
without a cmd import cycle. Adds a public GET /api/version endpoint
that returns {version, commit, build_date, go_version, dirty, deps}
as JSON. Makefile LDFLAGS retargeted at the new package path.
bearerAuthMiddleware public bypass list extended with /healthz,
/readyz, and /api/version so upcoming probes and version endpoint
remain public.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
/healthz is dependency-free liveness. /readyz checks SQLite PingContext
and an LLM provider Complete(maxTokens=1) reach, caching the verdict
for 10 seconds to absorb Prometheus + Kubernetes probe loops. Nil
provider (config provider: none) reports llm.status=skipped and keeps
readiness green. Legacy /health route remains as a 200-returning alias
for older clients; /healthz is the canonical probe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the ad-hoc text-format collector in internal/api/metrics.go
with the official prometheus/client_golang v1.20.5. Adds a new
internal/obs package hosting the Default registry and per-subject
collectors (HTTP, pipeline stages, embed latency, LLM tokens, workq
depth + rejections, build info).

Workq gains a Pool.Stats() snapshot accessor with rejectedTotal
counter; cmd/serve.go initialises obs and binds the live pool stats
provider. Pipeline IndexPath/IndexURL/Finalize are wrapped with
TimeStage for granular stage timings via a nil-safe helper.
Embedder observes per-batch provider latency; LLM Complete records
an approximate token count (bytes/4) until langchaingo usage data
is threaded through the Provider interface (tracked as follow-up).

HTTP recording moved out of loggingMiddleware's bespoke collector
into obs.HTTP.Observe; uses the Go 1.22 r.Pattern route to bound
label cardinality.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
loggingMiddleware now emits one JSON/text slog line per request with
{req_id, method, path, route, status, duration_ms, bytes_out, auth,
project, panic}. The emission is deferred so a panic escaping
recoveryMiddleware still produces an access-log entry. auth is a
coarse label (bearer|cookie|anon) because docsiq uses a single shared
API key; there is no real user identity. responseWriter now tracks
bytes via an overridden Write and also proxies Flush for SSE/
streaming handlers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New log.format config key (default "text"; DOCSIQ_LOG_FORMAT env).
Precedence is --log-format > env > config > default. The json handler
is wrapped in obs.NewProductionHandler, which strips a leading emoji
from slog Record.Message so log aggregators do not have to special-
case multi-byte sequences. The text handler keeps emoji for human
readers. Adds config-level defaults + env binding and three load-
level tests covering default, YAML, and env-var precedence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erface

Block 3 (merged to main) added BatchCeiling() int to llm.Provider.
After rebasing feat/block4-observability onto main, the mock provider
needed the new method to compile.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@aksOps aksOps force-pushed the feat/block4-observability branch from 9fabea4 to 234bcf1 Compare April 24, 2026 03:13
@socket-security
Copy link
Copy Markdown

socket-security Bot commented Apr 24, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedgolang/​github.com/​prometheus/​client_golang@​v1.20.572100100100100

View full report

@aksOps
Copy link
Copy Markdown
Contributor Author

aksOps commented Apr 24, 2026

@codex review

http.TimeoutHandler buffers the response body and does not implement
http.Flusher. Block 4's responseWriter wrapper grew a Flush() method,
so SSE handlers now pass the Flusher type-assertion and enter their
streaming loop — but every Flush is a no-op because the underlying
timeoutWriter absorbs writes until the request completes. The client
times out reading the body at 30s (matches cfg.Server.RequestTimeout).

Carve GET /api/upload/progress and GET /mcp out of the timeout wrapper
so Flush propagates to the real net/http writer. SSE teardown still
runs via r.Context() cancellation on client disconnect or shutdown.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 234bcf1e34

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/api/router.go Outdated
Comment thread internal/api/health.go Outdated
… lazy sq pinger

Three fixes bundled, all flagged by post-rebase CI or Codex review:

1. waitUploadDone now consumes the /api/upload/progress SSE stream
   incrementally via bufio.Scanner instead of ReadAll. Prior code
   worked on main only because Flush() was missing on responseWriter,
   which made the handler return 500 immediately; with Block 4's
   Flusher now wired through, ReadAll blocks until the handler closes
   the stream, tripping the client's 30s timeout before "done" arrives.

2. /readyz SQLite probe no longer installs a no-op success fallback
   when stores.Get fails at router build time. A lazy pinger resolves
   the default store at probe time, so a genuine open failure
   (permissions, corruption, disk) surfaces as 503 — and a store that
   becomes available later flips readiness green without a restart.

3. readyzCache.check decouples probe context from the incoming request
   context via context.WithoutCancel. Previously, a probing client
   (Kubernetes, Prometheus, curl) disconnecting mid-probe would return
   context.Canceled and pollute the 10-second cache for every
   subsequent caller.

Coverage: new TestReadyz_ProbeCtxDecoupledFromRequestCtx exercises (3).
TestIsStreamingRoute_Classification and
TestRequestTimeoutMiddleware_StreamingRouteBypassesTimeout already
cover the Block-4-aware SSE bypass added in 8f9595b.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@aksOps
Copy link
Copy Markdown
Contributor Author

aksOps commented Apr 24, 2026

@codex review

@aksOps aksOps merged commit 3169ccd into main Apr 24, 2026
11 checks passed
@aksOps aksOps deleted the feat/block4-observability branch April 24, 2026 04:16
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6947da5321

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/llm/provider.go
Comment on lines +129 to +133
resp, err := llms.GenerateFromSinglePrompt(ctx, p.llm, prompt, callOpts...)
if obs.LLM != nil {
// Approximation: 1 token ~= 4 bytes of UTF-8 for English prose.
// This is a coarse fallback until langchaingo's GenerationInfo
// usage data is threaded through the Provider interface
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip token accounting when completion fails

The token metric is updated unconditionally after GenerateFromSinglePrompt, so failed calls (for example context deadline, network errors, or upstream 5xx) still increment docsiq_llm_tokens_total using prompt length. That inflates usage/cost dashboards during outages and retries; only record tokens when the completion succeeds (or when real usage metadata is available).

Useful? React with 👍 / 👎.

Comment thread internal/api/metrics.go
Comment on lines +24 to 26
_ *project.Registry,
_ *projectStores,
_ *config.Config,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve project and note gauges in metrics handler

This rewrite ignores registry and stores and serves only the shared Prometheus registry, but no collector now emits docsiq_projects_total or docsiq_notes_total. Those gauges were previously exported from /metrics, so this change removes per-project visibility and can break existing dashboards/alerts that depend on those series.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant